home *** CD-ROM | disk | FTP | other *** search
-
-
- RFC: 816
-
-
-
- FAULT ISOLATION AND RECOVERY
-
- David D. Clark
- MIT Laboratory for Computer Science
- Computer Systems and Communications Group
- July, 1982
-
-
- 1. Introduction
-
-
- Occasionally, a network or a gateway will go down, and the sequence
-
- of hops which the packet takes from source to destination must change.
-
- Fault isolation is that action which hosts and gateways collectively
-
- take to determine that something is wrong; fault recovery is the
-
- identification and selection of an alternative route which will serve to
-
- reconnect the source to the destination. In fact, the gateways perform
-
- most of the functions of fault isolation and recovery. There are,
-
- however, a few actions which hosts must take if they wish to provide a
-
- reasonable level of service. This document describes the portion of
-
- fault isolation and recovery which is the responsibility of the host.
-
-
- 2. What Gateways Do
-
-
- Gateways collectively implement an algorithm which identifies the
-
- best route between all pairs of networks. They do this by exchanging
-
- packets which contain each gateway's latest opinion about the
-
- operational status of its neighbor networks and gateways. Assuming that
-
- this algorithm is operating properly, one can expect the gateways to go
-
- through a period of confusion immediately after some network or gateway
-
- 2
-
-
- has failed, but one can assume that once a period of negotiation has
-
- passed, the gateways are equipped with a consistent and correct model of
-
- the connectivity of the internet. At present this period of negotiation
-
- may actually take several minutes, and many TCP implementations time out
-
- within that period, but it is a design goal of the eventual algorithm
-
- that the gateway should be able to reconstruct the topology quickly
-
- enough that a TCP connection should be able to survive a failure of the
-
- route.
-
-
- 3. Host Algorithm for Fault Recovery
-
-
- Since the gateways always attempt to have a consistent and correct
-
- model of the internetwork topology, the host strategy for fault recovery
-
- is very simple. Whenever the host feels that something is wrong, it
-
- asks the gateway for advice, and, assuming the advice is forthcoming, it
-
- believes the advice completely. The advice will be wrong only during
-
- the transient period of negotiation, which immediately follows an
-
- outage, but will otherwise be reliably correct.
-
-
- In fact, it is never necessary for a host to explicitly ask a
-
- gateway for advice, because the gateway will provide it as appropriate.
-
- When a host sends a datagram to some distant net, the host should be
-
- prepared to receive back either of two advisory messages which the
-
- gateway may send. The ICMP "redirect" message indicates that the
-
- gateway to which the host sent the datagram is not longer the best
-
- gateway to reach the net in question. The gateway will have forwarded
-
- the datagram, but the host should revise its routing table to have a
-
- different immediate address for this net. The ICMP "destination
-
- 3
-
-
- unreachable" message indicates that as a result of an outage, it is
-
- currently impossible to reach the addressed net or host in any manner.
-
- On receipt of this message, a host can either abandon the connection
-
- immediately without any further retransmission, or resend slowly to see
-
- if the fault is corrected in reasonable time.
-
-
- If a host could assume that these two ICMP messages would always
-
- arrive when something was amiss in the network, then no other action on
-
- the part of the host would be required in order maintain its tables in
-
- an optimal condition. Unfortunately, there are two circumstances under
-
- which the messages will not arrive properly. First, during the
-
- transient following a failure, error messages may arrive that do not
-
- correctly represent the state of the world. Thus, hosts must take an
-
- isolated error message with some scepticism. (This transient period is
-
- discussed more fully below.) Second, if the host has been sending
-
- datagrams to a particular gateway, and that gateway itself crashes, then
-
- all the other gateways in the internet will reconstruct the topology,
-
- but the gateway in question will still be down, and therefore cannot
-
- provide any advice back to the host. As long as the host continues to
-
- direct datagrams at this dead gateway, the datagrams will simply vanish
-
- off the face of the earth, and nothing will come back in return. Hosts
-
- must detect this failure.
-
-
- If some gateway many hops away fails, this is not of concern to the
-
- host, for then the discovery of the failure is the responsibility of the
-
- immediate neighbor gateways, which will perform this action in a manner
-
- invisible to the host. The problem only arises if the very first
-
- 4
-
-
- gateway, the one to which the host is immediately sending the datagrams,
-
- fails. We thus identify one single task which the host must perform as
-
- its part of fault isolation in the internet: the host must use some
-
- strategy to detect that a gateway to which it is sending datagrams is
-
- dead.
-
-
- Let us assume for the moment that the host implements some
-
- algorithm to detect failed gateways; we will return later to discuss
-
- what this algorithm might be. First, let us consider what the host
-
- should do when it has determined that a gateway is down. In fact, with
-
- the exception of one small problem, the action the host should take is
-
- extremely simple. The host should select some other gateway, and try
-
- sending the datagram to it. Assuming that gateway is up, this will
-
- either produce correct results, or some ICMP advice. Since we assume
-
- that, ignoring temporary periods immediately following an outage, any
-
- gateway is capable of giving correct advice, once the host has received
-
- advice from any gateway, that host is in as good a condition as it can
-
- hope to be.
-
-
- There is always the unpleasant possibility that when the host tries
-
- a different gateway, that gateway too will be down. Therefore, whatever
-
- algorithm the host uses to detect a dead gateway must continuously be
-
- applied, as the host tries every gateway in turn that it knows about.
-
-
- The only difficult part of this algorithm is to specify the means
-
- by which the host maintains the table of all of the gateways to which it
-
- has immediate access. Currently, the specification of the internet
-
- protocol does not architect any message by which a host can ask to be
-
- 5
-
-
- supplied with such a table. The reason is that different networks may
-
- provide very different mechanisms by which this table can be filled in.
-
- For example, if the net is a broadcast net, such as an ethernet or a
-
- ringnet, every gateway may simply broadcast such a table from time to
-
- time, and the host need do nothing but listen to obtain the required
-
- information. Alternatively, the network may provide the mechanism of
-
- logical addressing, by which a whole set of machines can be provided
-
- with a single group address, to which a request can be sent for
-
- assistance. Failing those two schemes, the host can build up its table
-
- of neighbor gateways by remembering all the gateways from which it has
-
- ever received a message. Finally, in certain cases, it may be necessary
-
- for this table, or at least the initial entries in the table, to be
-
- constructed manually by a manager or operator at the site. In cases
-
- where the network in question provides absolutely no support for this
-
- kind of host query, at least some manual intervention will be required
-
- to get started, so that the host can find out about at least one
-
- gateway.
-
-
- 4. Host Algorithms for Fault Isolation
-
-
- We now return to the question raised above. What strategy should
-
- the host use to detect that it is talking to a dead gateway, so that it
-
- can know to switch to some other gateway in the list. In fact, there are
-
- several algorithms which can be used. All are reasonably simple to
-
- implement, but they have very different implications for the overhead on
-
- the host, the gateway, and the network. Thus, to a certain extent, the
-
- algorithm picked must depend on the details of the network and of the
-
- host.
-
- 6
-
-
-
- 1. NETWORK LEVEL DETECTION
-
-
- Many networks, particularly the Arpanet, perform precisely the
-
- required function internal to the network. If a host sends a datagram
-
- to a dead gateway on the Arpanet, the network will return a "host dead"
-
- message, which is precisely the information the host needs to know in
-
- order to switch to another gateway. Some early implementations of
-
- Internet on the Arpanet threw these messages away. That is an
-
- exceedingly poor idea.
-
-
- 2. CONTINUOUS POLLING
-
-
- The ICMP protocol provides an echo mechanism by which a host may
-
- solicit a response from a gateway. A host could simply send this
-
- message at a reasonable rate, to assure itself continuously that the
-
- gateway was still up. This works, but, since the message must be sent
-
- fairly often to detect a fault in a reasonable time, it can imply an
-
- unbearable overhead on the host itself, the network, and the gateway.
-
- This strategy is prohibited except where a specific analysis has
-
- indicated that the overhead is tolerable.
-
-
- 3. TRIGGERED POLLING
-
-
- If the use of polling could be restricted to only those times when
-
- something seemed to be wrong, then the overhead would be bearable.
-
- Provided that one can get the proper advice from one's higher level
-
- protocols, it is possible to implement such a strategy. For example,
-
- one could program the TCP level so that whenever it retransmitted a
-
- 7
-
-
- segment more than once, it sent a hint down to the IP layer which
-
- triggered polling. This strategy does not have excessive overhead, but
-
- does have the problem that the host may be somewhat slow to respond to
-
- an error, since only after polling has started will the host be able to
-
- confirm that something has gone wrong, and by then the TCP above may
-
- have already timed out.
-
-
- Both forms of polling suffer from a minor flaw. Hosts as well as
-
- gateways respond to ICMP echo messages. Thus, polling cannot be used to
-
- detect the error that a foreign address thought to be a gateway is
-
- actually a host. Such a confusion can arise if the physical addresses
-
- of machines are rearranged.
-
-
- 4. TRIGGERED RESELECTION
-
-
- There is a strategy which makes use of a hint from a higher level,
-
- as did the previous strategy, but which avoids polling altogether.
-
- Whenever a higher level complains that the service seems to be
-
- defective, the Internet layer can pick the next gateway from the list of
-
- available gateways, and switch to it. Assuming that this gateway is up,
-
- no real harm can come of this decision, even if it was wrong, for the
-
- worst that will happen is a redirect message which instructs the host to
-
- return to the gateway originally being used. If, on the other hand, the
-
- original gateway was indeed down, then this immediately provides a new
-
- route, so the period of time until recovery is shortened. This last
-
- strategy seems particularly clever, and is probably the most generally
-
- suitable for those cases where the network itself does not provide fault
-
- isolation. (Regretably, I have forgotten who suggested this idea to me.
-
- It is not my invention.)
-
- 8
-
-
- 5. Higher Level Fault Detection
-
-
- The previous discussion has concentrated on fault detection and
-
- recovery at the IP layer. This section considers what the higher layers
-
- such as TCP should do.
-
-
- TCP has a single fault recovery action; it repeatedly retransmits a
-
- segment until either it gets an acknowledgement or its connection timer
-
- expires. As discussed above, it may use retransmission as an event to
-
- trigger a request for fault recovery to the IP layer. In the other
-
- direction, information may flow up from IP, reporting such things as
-
- ICMP Destination Unreachable or error messages from the attached
-
- network. The only subtle question about TCP and faults is what TCP
-
- should do when such an error message arrives or its connection timer
-
- expires.
-
-
- The TCP specification discusses the timer. In the description of
-
- the open call, the timeout is described as an optional value that the
-
- client of TCP may specify; if any segment remains unacknowledged for
-
- this period, TCP should abort the connection. The default for the
-
- timeout is 30 seconds. Early TCPs were often implemented with a fixed
-
- timeout interval, but this did not work well in practice, as the
-
- following discussion may suggest.
-
-
- Clients of TCP can be divided into two classes: those running on
-
- immediate behalf of a human, such as Telnet, and those supporting a
-
- program, such as a mail sender. Humans require a sophisticated response
-
- to errors. Depending on exactly what went wrong, they may want to
-
- 9
-
-
- abandon the connection at once, or wait for a long time to see if things
-
- get better. Programs do not have this human impatience, but also lack
-
- the power to make complex decisions based on details of the exact error
-
- condition. For them, a simple timeout is reasonable.
-
-
- Based on these considerations, at least two modes of operation are
-
- needed in TCP. One, for programs, abandons the connection without
-
- exception if the TCP timer expires. The other mode, suitable for
-
- people, never abandons the connection on its own initiative, but reports
-
- to the layer above when the timer expires. Thus, the human user can see
-
- error messages coming from all the relevant layers, TCP and ICMP, and
-
- can request TCP to abort as appropriate. This second mode requires that
-
- TCP be able to send an asynchronous message up to its client to report
-
- the timeout, and it requires that error messages arriving at lower
-
- layers similarly flow up through TCP.
-
-
- At levels above TCP, fault detection is also required. Either of
-
- the following can happen. First, the foreign client of TCP can fail,
-
- even though TCP is still running, so data is still acknowledged and the
-
- timer never expires. Alternatively, the communication path can fail,
-
- without the TCP timer going off, because the local client has no data to
-
- send. Both of these have caused trouble.
-
-
- Sending mail provides an example of the first case. When sending
-
- mail using SMTP, there is an SMTP level acknowledgement that is returned
-
- when a piece of mail is successfully delivered. Several early mail
-
- receiving programs would crash just at the point where they had received
-
- all of the mail text (so TCP did not detect a timeout due to outstanding
-
- 10
-
-
- unacknowledged data) but before the mail was acknowledged at the SMTP
-
- level. This failure would cause early mail senders to wait forever for
-
- the SMTP level acknowledgement. The obvious cure was to set a timer at
-
- the SMTP level, but the first attempt to do this did not work, for there
-
- was no simple way to select the timer interval. If the interval
-
- selected was short, it expired in normal operational when sending a
-
- large file to a slow host. An interval of many minutes was needed to
-
- prevent false timeouts, but that meant that failures were detected only
-
- very slowly. The current solution in several mailers is to pick a
-
- timeout interval proportional to the size of the message.
-
-
- Server telnet provides an example of the other kind of failure. It
-
- can easily happen that the communications link can fail while there is
-
- no traffic flowing, perhaps because the user is thinking. Eventually,
-
- the user will attempt to type something, at which time he will discover
-
- that the connection is dead and abort it. But the host end of the
-
- connection, having nothing to send, will not discover anything wrong,
-
- and will remain waiting forever. In some systems there is no way for a
-
- user in a different process to destroy or take over such a hanging
-
- process, so there is no way to recover.
-
-
- One solution to this would be to have the host server telnet query
-
- the user end now and then, to see if it is still up. (Telnet does not
-
- have an explicit query feature, but the host could negotiate some
-
- unimportant option, which should produce either agreement or
-
- disagreement in return.) The only problem with this is that a
-
- reasonable sample interval, if applied to every user on a large system,
-
- 11
-
-
- can generate an unacceptable amount of traffic and system overhead. A
-
- smart server telnet would use this query only when something seems
-
- wrong, perhaps when there had been no user activity for some time.
-
-
- In both these cases, the general conclusion is that client level
-
- error detection is needed, and that the details of the mechanism are
-
- very dependent on the application. Application programmers must be made
-
- aware of the problem of failures, and must understand that error
-
- detection at the TCP or lower level cannot solve the whole problem for
-
- them.
-
-
- 6. Knowing When to Give Up
-
-
- It is not obvious, when error messages such as ICMP Destination
-
- Unreachable arrive, whether TCP should abandon the connection. The
-
- reason that error messages are difficult to interpret is that, as
-
- discussed above, after a failure of a gateway or network, there is a
-
- transient period during which the gateways may have incorrect
-
- information, so that irrelevant or incorrect error messages may
-
- sometimes return. An isolated ICMP Destination Unreachable may arrive
-
- at a host, for example, if a packet is sent during the period when the
-
- gateways are trying to find a new route. To abandon a TCP connection
-
- based on such a message arriving would be to ignore the valuable feature
-
- of the Internet that for many internal failures it reconstructs its
-
- function without any disruption of the end points.
-
-
- But if failure messages do not imply a failure, what are they for?
-
- In fact, error messages serve several important purposes. First, if
-
- 12
-
-
- they arrive in response to opening a new connection, they probably are
-
- caused by opening the connection improperly (e.g., to a non-existent
-
- address) rather than by a transient network failure. Second, they
-
- provide valuable information, after the TCP timeout has occurred, as to
-
- the probable cause of the failure. Finally, certain messages, such as
-
- ICMP Parameter Problem, imply a possible implementation problem. In
-
- general, error messages give valuable information about what went wrong,
-
- but are not to be taken as absolutely reliable. A general alerting
-
- mechanism, such as the TCP timeout discussed above, provides a good
-
- indication that whatever is wrong is a serious condition, but without
-
- the advisory messages to augment the timer, there is no way for the
-
- client to know how to respond to the error. The combination of the
-
- timer and the advice from the error messages provide a reasonable set of
-
- facts for the client layer to have. It is important that error messages
-
- from all layers be passed up to the client module in a useful and
-
- consistent way.
-
-
- -------
-